Using Language Models to Detect Wikipedia Vandalism
نویسندگان
چکیده
This paper explores a statistical language modeling approach for detecting Wikipedia vandalism. Wikipedia is a popular and influential collaborative information system. The collaborative nature of authoring, as well as the high visibility of its content, have exposed Wikipedia articles to vandalism, defined as malicious editing intended to compromise the integrity of the content of articles. Extensive manual efforts are being made to combat vandalism and an automated approach to alleviate the laborious process is essential. This paper offers first a categorization of Wikipedia vandalism types and identifies technical challenges associated with detecting each category. In addition, this paper builds statistical language models, constructing distributions of words from the revision history of Wikipedia articles. As vandalism often involves the use of unexpected words to draw attention, the fitness (or lack thereof) of a new edit when compared with language models built from previous versions may well indicate that an edit is a vandalism instance. The Wikipedia domain with its revision histories offers a novel context in which to explore the potential of language models in characterizing author intention. As the experimental results presented in the paper demonstrate, these models hold promise for vandalism detection.
منابع مشابه
Detecting Vandalism on Wikipedia across Multiple Languages
Vandalism, the malicious modification or editing of articles, is a serious problem for free and open access online encyclopedias such as Wikipedia. Over the 13 year lifetime of Wikipedia, editors have identified and repaired vandalism in 1.6% of more than 500 million revisions of over 9 million English articles, but smaller manually inspected sets of revisions for research show vandalism may ap...
متن کاملLanguage of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis
Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism de...
متن کامل"Got You!": Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling
Discriminating vandalism edits from non-vandalism edits in Wikipedia is a challenging task, as ill-intentioned edits can include a variety of content and be expressed in many different forms and styles. Previous studies are limited to rule-based methods and learning based on lexical features, lacking in linguistic analysis. In this paper, we propose a novel Web-based shallow syntacticsemantic m...
متن کاملEnriching Wikipedia Vandalism Taxonomy via Subclass Discovery
This paper adopts an unsupervised subclass discovery approach to automatically improve the taxonomy of Wikipedia vandalism. Wikipedia vandalism, defined as malicious editing intended to compromise the integrity of the content of articles, exhibits heterogeneous characteristics, making it hard to detect automatically. The categorization of vandalism provides insights on the detection of vandalis...
متن کاملDivide and Transfer: an Exploration of Segmented Transfer to Detect Wikipedia Vandalism
The paper applies knowledge transfer methods to the problem of detecting Wikipedia vandalism detection, defined as malicious editing intended to compromise the integrity of the content of articles. A major challenge of detecting Wikipedia vandalism is the lack of a large amount of labeled training data. Knowledge transfer addresses this challenge by leveraging previously acquired knowledge from...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009